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Abstract 

Despite playing a crucial role in germline-soma differentiation, the evolutionary significance of developmentally regulated genome 
rearrangements (DRGRs) has received scant attention. An example of DRGR is DNA splicing, a process that removes segments of DNA 
interrupting genie and/or intergenic sequences. Perhaps, best known for shaping immune-system genes invertebrates, DNA splicing 
plays a central role in the life of ciliated protozoa, where thousands of germline DNA segments are eliminated after sexual repro- 
duction to regenerate a functional somatic genome. Here, we identify and chronicle the properties of 5,286 sequences that putatively 
undergo DNA splicing (i.e., internal eliminated sequences [lESs]) across the genomes of three closely related species of the ciliate 
Paramecium {P. tetraurelia, P. biaurelia, and P. sexaurelia). The study reveals that these putative lESs share several physical charac- 
teristics. Although our results are consistent with excision events being largely conserved between species, episodes of differential lES 
retention/excision occur, may have a recent origin, and frequently involve coding regions. Our findings indicate interconversion 
between somatic — often coding — DNA sequences and noncoding lESs, and provide insights into the role of DNA splicing in creating 
potentially functional genetic innovation. 

Key words: ciliated protozoa, developmentally regulated genome rearrangements, DNA splicing, internal eliminated 
sequences, genome evolution. 



Introduction 

A fundamental goal in evolutionary biology is to identify and 
understand the processes that, by shaping genome content 
and architecture, give rise to evolutionary novelties. 
Developmentally regulated genome rearrangements (DRGRs) 
are excellent candidates for generating evolutionary innova- 
tion. DRGRs, such as chromatin/chromosome elimination and 
chromatin diminution, take place during germline-soma dif- 
ferentiation and can create vast diversity in genome architec- 
ture, often producing significant changes in the fate of the 
affected cells (Kloc and Zagrodzinska 2001 ; Zufall et al. 2005). 

An intriguing example of DRGR is DNA splicing, a process 
that removes segments of DNA interrupting genie and/or 
intergenic sequences. Perhaps, best known for initiating the 
rearrangement of immunoglobulin and T-cell receptor genes 
in vertebrates during the differentiation of lymphocytes 
(Fugmann et al. 2000), DNA splicing also mediates a more 
dramatic example of chromatin diminution in ciliated proto- 
zoa. In these single-celled organisms, the whole genome is 



subject to extensive elimination of noncoding internal elimi- 
nated sequences (lESs) (Mochizuki and Gorovsky 2005; 
Kowalczyk et al. 2006; Gratias et al. 2008; Lepere et al. 
2008; Kapusta et al. 2011). Such processing follows sexual 
reproduction and is possible because of the binucleate 
nature of ciliate cells. Briefly, the diploid, germline micronu- 
cleus contains the entire genome and is transcriptionally silent; 
after meiosis and syngamy, the micronuclear DNA regenerates 
a new somatic macronucleus, where all gene expression 
occurs, while the old, maternal macronucleus degrades. 
Together with major events including chromosome fragmen- 
tation and genome amplification, lES elimination crucially 
contributes to the development of the new macronuclear 
DNA. 

In the ciliate Paramecium, lESs are commonly short, AT-rich, 
and flanked by 5'-TA-3' dinucleotide repeats that are in turn 
part of larger 8-bp imperfect terminal inverted repeats 
(Betermier 2004). Initially observed for a limited number of 
lESs isolated from the micronuclear genome, these 
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characteristics were later corroborated by a study where the 
isolation of micronuclear DNA, a laborious and problematic 
step, was elegantly overcome (Duret et al. 2008). Specifically, 
because lES excision in Paramecium is not entirely faithful and 
is preceded by a polyploidization stage (Betermier et al. 2000), 
Duret et al. (2008) were able to detect hundreds of TA-flanked 
sequences that are imperfectly excised from a few copies 
of the polyploid macronuclear genome of a P. tetraurelia 
clone. The vast majority (93%) of these TA-indels consists of 
true lESs, as shown by the recently published P. tetraurelia 
germline genome sequence (Arnaiz et al. 2012). Thus, while 
lES splicing in Paramecium is typically precise, events of incom- 
plete excision occur. This excision error is of a nontrivial 
magnitude, as it can amount to more than 1 in 10 lESs 
(Duret et al. 2008). 

Because unfaithful lES excision generates distinct DNA 
isoforms, one may ask: do some of these newly generated 
DNA isoforms acquire biological significance? In a hypothet- 
ical but verifiable evolutionary scenario, some events of im- 
perfect lES excision would become established within the 
macronucleus of an individual. When nonlethal and herita- 
ble, these lES-retaining somatic alleles can spread through 
the sexual population with some probability of fixation. This 
is plausible in the light of our current understanding of 
the lES excision mechanisms as maternally determined. 
Specifically, in addition to being mediated by c/s-acting sig- 
nals located at the lESs' 3'- and 5'-termini, lES excision in 
Paramecium is also epigenetically regulated. The presence of 
an lES in the maternal (prezygotic) macronucleus — which 
gradually disappears from the cytoplasm after sexual repro- 
duction — can inhibit its own excision from the next gener- 
ation's macronuclear genome (Duharcourt et al. 1995). This 
homology-dependent maternal inhibition may be limited to 
a subset of maternally controlled lESs (Duharcourt et al. 
1998). Thus, a sufficiently high number of imperfect lES 
excision events in the maternal macronucleus can lead 
to IBS retention in the next generation's macronuclear 
genomes (Duharcourt et al. 1995). 

In this study, we build on the findings of Duret et al. (2008) 
in P. tetraurelia, to extend the identification of putative lESs to 
two additional moderately divergent P. aurelia species 
{P. biaurelia and P. sexaurelia) (Catania et al. 2009) whose 
somatic genomes have been sequenced (currently unpub- 
lished). In addition to determining physical characteristics 
shared by the three species-specific sets of TA-indels, we iden- 
tify elements that are differentially excised between (and 
within) species, and provide examples of putative conversion 
of macronuclear-destined sequences into lESs. The existence 
of variability in the pattern of lES excision among closely re- 
lated species of ciliated protozoa that is indicated by these 
results provides new insights into the roles that DRGRs — 
DNA splicing in particular — have in the evolution of genome 
architecture and gene structure in ciliates. 



Materials and Methods 

Treatment of Sequence Reads, Mapping Procedure, and 
Extraction of TA-lndel Sequences 

Assembled macronuclear genome sequences and correspond- 
ing raw sequence reads and quality data were collected from 
three Paramecium species, P. tetraurelia, P. biaurelia, and 
P. sexaurelia, all belonging to the P. aurelia species complex 
(Sonneborn 1975). The data files were downloaded from 
ParameciumDB (Arnaiz et al. 2007; Arnaiz and Sperling 
2011) for P. tetraurelia, and generated at Indiana University 
for P. biaurelia and P. sexaurelia (McGrath CL, Doak TG, Lynch 
M, unpublished data). Cultures of P. biaurelia and P. sexaurelia 
(stock VI -4 and AZ8-4, respectively, from the collection of A. 
Potekhin) were grown to saturation in Wheat Grass Powder 
(Pines International, Lawrence, KS) medium previously inocu- 
lated with /^erofaacfer aerogenes. Relatively pure macronuclei 
(free of bacterial, mitochondrial, and micronuclear DNA con- 
tamination) were obtained by the method of Aury et al. 
(2006). An almost complete absence of reads for the mito- 
chondrial genome supports the efficacy of the macronuclear 
purification. Macronuclear DNA was further purified by the 
CTAB method (Ausubel et al. 1994). Libraries for lllumina and 
454 (8 kb inserts) sequencing were generated at the Center 
for Genomics and Bioinformatics (Indiana University). The av- 
erage genomic coverage for P. biaurelia and P. sexaurelia is 
approximately 45 X and 42 x, respectively. 

To identify putative lESs for each of the three Paramecium 
species, we designed a bioinformatics pipeline that screens the 
collections of macronuclear sequence reads for incorrect exci- 
sion events, an approach used in a recent study by Duret et al. 
(2008). Specifically, raw sequence read ends were trimmed 
until the terminal sites (grouped in windows of 30 nucleotides) 
showed an average quality score higher than 10. Nucleotides 
in sequence reads were masked when their quality score was 
less than 1 0 or when they matched the P. tetraurelia telomeric 
repeat, that is, three repeats of the hexanucleotide 
CCC[CA]AA, with at most one mismatch. Trimmed and 
masked sequence reads were subsequently mapped to the 
corresponding macronuclear genome assembly using BLAT 
(Kent 2002). The alignment quality of retained BLAT hits 
was improved using the MUSCLE program (Edgar 2004) and 
indels were treated as bona fide lESs (hereafter referred to 
simply as lESs) if: 1) the length of the hit genomic region 
covers >50% of the read length (masked nucleotides not 
included); 2) the hit genomic region contains an insertion-de- 
letion (indel) whose size falls between 1 0 and 5,000 bp; 3) the 
sequence read maps to a unique location; 4) the indel is im- 
mediately flanked by a TA dinucleotide — which is known to 
flank lESs in several Paramecium species — and preceded or 
followed by at least 30 ungapped and unmasked sites; 5) 
the indel is included in an alignment that contains at most 
two >10-nt gaps and where the overall level of sequence 
identity is >95%. A summary of the key bioinformatics 
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steps, including the count of sequence reads after the filtering 
step and the total number of putative lESs collected after the 
realignment step is presented in supplementary table SI, 
Supplementary Matenal online. 

We additionally used BLAT to extract the total number of 
gapped and ungapped sequence reads that map uniquely 
onto the detected lES loci. For this analysis, merged lES-flank- 
ing regions (40 nucleotides [nt] from each end) and lES-flank- 
ing regions (40 nt from each end) plus the intervening lES 
were used to query the collection of reads from the corre- 
sponding species. 

Characterization of lESs 

As in Duret et al. (2008) our approach detected two types of 
TA-flanked sequences: 1 ) present in the genome assembly but 
excised from one or more sister reads (we call these sequences 
cryptic lESs), and 2) absent from the genome assembly but 
inserted in one or more reads (we call these sequences incom- 
pletely excised lESs). 

Note that some incompletely excised lESs may be spurious 
because of reads arising from contaminating micronuclear 
DNA. Although incompletely excised lESs and lESs as micro- 
nuclear DNA contaminants cannot be distinguished in our 
study, we expect the latter type of lESs to be uncommon for 
at least two reasons: 1) macronuclear DNA enrichment treat- 
ments were adopted before genome sequencing for each of 
the three species surveyed (as a result, the sequence reads of 
the three Paramecium species rarely match mitochondrial 
DNA), and 2) in P. aurelia cells, the macronuclear DNA 
(-800 haploid copies) is approximately 200 times more abun- 
dant than the micronuclear DNA (four haploid copies). In 
fact, for P. tetraurelia, micronuclear DNA contamination has 
been suggested not to exceed 50 in 10® reads (Duret et al. 
2008). 

Although most characterization used custom perl scripts, 
the degree of intraspecies conservation of 8-bp sequences at 
the lES termini and the 11 -bp regions immediately flanking 
the lESs was analyzed using RNA Structure Logo (Schneider 
and Stephens 1990; Gorodkin et al. 1997). The logo genera- 
tion process takes into account the nucleotide distribution in 
each of the three P. aurelia species' lES data set. In the result- 
ing logo, the height of each nucleotide is proportional to its 
observed frequency relative to the expected frequency — 
nucleotides whose frequency is less than expected are dis- 
played upside down. The following options were used in 
RNA Structure Logo: plain sequence, no field assignment, 
and logo type 2. 

Reciprocal Mapping of lESs among Species and Retrieval 
of Orthologs and Paralogs 

After collecting and characterizing physical characteristics of 
the lESs for each of the three P. aurelia species, we examined 
the location of the lESs. As the P. tetraurelia genome has been 



annotated, the location of lESs in this species was inferred by 
comparing the scaffold coordinates of the lESs with those of 
the annotated genes. To infer the location of the lESs detected 
for the remaining two species, whose genomes have not yet 
been annotated, we took advantage of the annotation of the 
P. tetraurelia genome. Specifically, we used the BLAST pro- 
gram (Altschul et al. 1997) to map the sequence reads asso- 
ciated with the lESs in P. biaurelia and P. sexaurelia to the 
P. tetraurelia genome. All regions sharing significant sequence 
similarity between species (E value = 10^^) were retrieved but 
only BLAST best hits were used to infer lES location. In cases 
where BLAST best hits only partially covered the sequence 
reads and did not encompass the lES, the coordinates of the 
mapped sequence were used to extrapolate the lES position. 

BLAST was also employed to determine the absence (or 
presence) of the lESs detected in one P. aurelia species in 
the macronuclear DNA of the remaining species. Three inde- 
pendent BLAST input sequences were used: 1) merged lES- 
flanking regions (40 nt from each end); 2) lES-flanking regions 
(40 nt from each end) plus the intervening lES; and 3) lES se- 
quences only. BLAST hits produced by the first set of input 
sequences reveal the absence of lESs in the macronuclear 
genome of a given P. aurelia species. Hits were only accepted 
if their coverage extends for more than 3/4 the length of the 
query sequence (i.e., >60bp). BLAST hits produced by the 
second and third sets of input sequences uncover events of 
lES retention. These BLAST hits are expected to partially over- 
lap. Hits produced by the second set of input sequences were 
accepted (and visually examined) if at least a 20-nt tract of the 
lES detected in one species matched the macronuclear 
genome sequence of either of the other two species. Finally, 
the BLAST hits obtained for the first and second set of input 
sequences were used to calculate the overall average fraction 
of differentially excised lESs between the species. 

The detection of variably excised lESs was followed by the 
(BLAST-mediated) retrieval of all the detectable DNA regions 
that are orthologous or paralogous to the region encompass- 
ing the polymorphic lES across the three P. aurelia species. The 
DIALIGN sequence alignment program (Morgenstern 1999) 
and manual editing were used to generate accurate sequence 
alignments, which were visualized with BioEdit (Hall 1999). 
The Maximum Composite Likelihood model implemented in 
MEGA 4.0 and bootstrap statistics (1,000 replicates) were ap- 
plied to all the sites of the examined regions to build Neighbor- 
Joining trees (Tamura et al. 2007). 

Results 

By mapping sequence reads of each Paramecium species onto 
the corresponding genome assembly, we identified 2,345 
putative lESs in P. tetraurelia, 1,538 in P. biaurelia and 1,403 
in P. sexaurelia (table 1 ). The putative lESs that we detected in 
this study (hereafter referred to as simply lESs) can be differ- 
entiated into two classes: 1) incompletely excised lESs 
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Table 1 

Summary of lES Counts, Types, AT-Contents, and Genomic Locations 

Paramecium tetraurelia P. biaurelia P. sexaurelia 



Incompletely Cryptic Incompletely Cryptic Incompletely Cryptic 





Excised 




Excised 




Excised 




Nonoverlapping, single-copy lESs 


875 


1,470 


398 


1,140 


462 


941 


AT richness 


0.76 ±0.08 


0.78 ± 0.08 


0.81 ±0.07 


0.7910.08 


0.81 ±0.07 


0.80 ± 0.07 


Number (percentage) of lESs 


276 (32%) 


471 (32%) 


117 (29%) 


378 (33%) 


137 (30%) 


297 (32%) 


with size multiple of 3 (i.e., 3n) 














Location in the genome" 






All lESs (3n lESs) 






Coding exon 


464 (141) 


370 (99) 


194 (67) 


325 (111) 


226 (63) 


180 (47) 


UTR 


9 (5) 


8(2) 


9 (1) 


7(1) 


10(3) 


2(0) 


Intron 


29 (5) 


4(1) 


12 (3) 


1 (0) 


21 (8) 


5(2) 


Intergenic 


373 (125) 


788 (274) 


45 (8) 


186 (47) 


45 (17) 


111 (40) 



^Based on the P. tetraurelia genome annotation. Estimates for cryptic lESs refer to elements that overlap with only one of the four classes of DNA sequence. 



(i.e., sequences absent fronn the somatic genome assembly 
but found in one or more reads) and 2) cryptic lESs (i.e., se- 
quences present in tlie somatic genome assembly but excised 
from one or more sister reads). Using tlie recently publisiied 
P. tetraurelia germline DNA sequence (Arnaiz et al. 2012), we 
could verify that approximately 90% of the incompletely ex- 
cised lESs are either full-length (73%) or fragmented (17%) 
lESs. The identification of about half as many putative ele- 
ments in P. biaurelia and P. sexaurelia as in P. tetraurelia is 
likely due to the much shorter average read length produced 
by the technology employed to sequence the P. biaurelia and 
P. sexaurelia somatic genomes (supplementary table SI, 
Supplementary Material online). The surveyed lESs are non- 
overlapping and single-copy. The AT-richness in these se- 
quences is 4-5% higher than the average AT-content of the 
macronuclear genome (Aury et al. 2006; McGrath CL, Doak 
TG, Lynch M, unpublished data). Also, the average AT-content 
of P. biaurelia and P. sexaurelia lESs is 4-5% higher than that 
of P. tetraurelia lESs (supplementary tables S2A and S2B, 
Supplementary Material online). The lES size distribution is 
similar to that reported for lESs in P. tetraurelia (Gratias and 
Betermier 2001; Arnaiz et al. 2012) and comparable across 
the three species, with one major mode at approximately 
27 bp and a minor mode at approximately 44 bp followed 
by a long tail (fig. 1). Based on the published P. tetraurelia 
somatic genome annotation (for the remaining two genomes 
the process of annotation is currently ongoing), we conclude 
that all three species-specific sets of lESs are located both in 
coding and noncoding regions (table 1). 

Inverted Repeats and Conserved Flanking Motifs of the 
Detected lESs 

The consensus of the 5'- and 3'-ends of lESs in Paramecium 
(5'-TAYAGYNR-3') resembles the termini of Tcl/mariner trans- 
posons (Klobutcher and Herrick 1 995). As in previous work. 



a large fraction of the extremities of lESs identified in this study 
produce a similar consensus (fig. 2). When the two classes of 
lESs are considered, incompletely excised lESs share the canon- 
ical consensus, whereas the termini of cryptic lESs display a 
much lower level of conservation (fig. 2; supplementary tables 
S2A and S2B, Supplementary Material online). The enrich- 
ment (or deficit) of particular nucleotides in the lES termini is 
comparable across the three species and mainly involves po- 
sitions 3, 4, and 5 (after the conserved TA dinucleotide). Also, 
when the two ends of lESs are compared, positions 3, 4, and 5 
are significantly more conserved than positions 6, 7, and 8 
(supplementary table S3, Supplementary Material online). 
This conserved pattern of nucleotide enrichment suggests 
that lES termini — particularly the nucleotides at positions 3, 
4, and 5 — play a function associated with presumably 
equivalent mechanisms of lES recognition/excision across 
P. aurelia species. 

An exploratory search for further c/s-regulatory excision 
signals — in addition to the crucial TA dinucleotide (Mayer 
and Forney 1999; Matsuda et al. 2004)— revealed a distinct 
and recurrent (about twice expected) 5'-GG|CC-3' motif 
(where | indicates the intervening lES) flanking incompletely 
excised lESs in each of the three P. aurelia species (supple- 
mentary fig. SI A, Supplementary Material online). Although 
less pronounced, an enrichment of Gs and Cs is apparent 
around cryptic lESs as well (supplementary fig. SIS, 
Supplementary Material online). A similar observation had 
been reported for a limited number of lESs in P. tetraurelia 
(Gratias and Betermier 2003) and suggests that these c/s sites 
too may be involved in the process of lES definition or exci- 
sion. A hypothesis to explain such involvement — other than 
possibly facilitating the generation of excision-associated 
loop structures — is that an overrepresentation of guanines 
and cytosines around lESs in an AT-rich genome signals the 
presence of an lES to the excision machinery. 
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Fig. 1. — lES size distribution. Size distribution (in base pairs) of tlie two sets of putative lESs (i.e., imperfectly excised and cryptic lESs) detected for 
Paramecium tetraurelia, P. biaurelia, and P. sexaurelia. 



Constraints on Distribution of lESs within Genes 

Analyses of spliced RNA and DNA sequences have detected 
selective constraints on the evolution of coding regions (Duret 
et al. 2008; Jaillon et al. 2008). Briefly, both spliceosomal in- 
trons and lESs in coding sequences appear to share patterns of 
selection to provide either in-frame stop codons (premature 
translation termination codons or PTCs) or frame shifts 
(length / a multiple of 3 (3n), which will quickly result in trans- 
lation termination). The most likely hypothesis to explain such 
selection is that in the event of (accidental) retention in tran- 
scripts, noncoding sequences without these features would be 
invisible to the cellular surveillance systems and can generate 
toxic products. 

We verified the existence of similar selective constraints 
in our data set. To this end, we studied the overall number 
of 3n lESs and the relative abundance of PTC -free and 



PTC -containing lESs in the genome of P. tetraurelia. As previ- 
ously reported (Duret et al. 2008), we found that fewer 3n 
(cryptic) lESs than expected by chance map to coding 
sequences (x^ = 6.47, df=1, P= 0.011), and fewer 3n 
PTC -free (incompletely excised) lESs than expected map to 
coding sequences (x^ = 4.74, df=1, PpTc-free-3n = 0.029; 
X^=1.34, df=1, PpTc-containing-3n = 0.246) (supplementary 
table S4, Supplementary Material online). We detected no 
deviation from expected frequencies within intergenic regions 

(X^ = 0.049, df= 1, Pimperfectly excised IESs = 0.826; x'= 1-125, 
df=1, Pcryptic IESs = 0.289). 

In addition, we asked: how is a selective response triggered 
by imperfect lES excision events? Current observations sug- 
gest that imperfect lES excisions typically involve only a small 
fraction (typically -1/13) of the highly polyploid (~800n) so- 
matic genome. It is unlikely that this limited fraction of 
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Fig. 2. — Nucleotide composition and frequency of lES termini consen- 
sus sequences. Nucleotide composition and frequency of the consensus 
sequences of the 8-bp imperfect terminal inverted repeat (IR) of lESs 
detected for Paramecium tetraurelia, P. biaurelia, and P. sexaurelia. 
Nucleotide frequency is expressed as the proportion between the observed 
frequency relative to the expected frequency (the expected frequency is 
calculated on the basis of the average nucleotide composition of the spe- 
cies-specific set of lESs). 



lES-retaining sonnatic DNA copies has major effects on the 
fitness of the carrier. Thus, we nnust conclude that imperfect 
excisions may occasionally involve relatively large fractions of 
the somatic DNA copies: fractions that are large enough to 
impinge on fitness and trigger an adequate selective response. 
Under this selective scenario, imperfectly excised lESs that 
reside within genes may exhibit a positional trend that 
favors their elimination. Specifically, a PTC erroneously intro- 
duced in a transcript and proximal to the translation start site 
leads to a more rapid interruption of the expensive process of 
protein synthesis and would be most efficiently recognized by 
the nonsense-mediated decay (NMD) system (van Hoof and 
Green 1 996; Silva et al. 2006; Longman et al. 2007). Although 
the NMD efficiency of PTC recognition has not yet been in- 
vestigated in Paramecium, if NMD operates in Paramecium as 
it does in other eukaryotic systems we may expect to observe 
the following: 1) PTC -free non-3n and PTC -containing lESs 
accumulating at the 5'-ends of genes and 2) no positional 
bias for stopless 3n lESs, as their retention is invisible to NMD. 



In our data set, the location of lESs within P. tetraurelia 
genes is indeed biased toward the 5' -end: there is a 5'-end 
bias for both 3n and non-3n lESs (3n lESs: r= -0.5270, P 
value = 0.0588; non-3n lESs: /-= -0.8544, P value = 0.0008), 
whereas a separate analysis of 3n vs. non-3n lESs reveals that 
the preferential location near a gene's 5'-end involves PTC- 
containing lESs and PTC -free non-3n lESs. No significant trend 
is observed, as expected, for PTC -free 3n lESs (fig. 3). 

Overall, these results are consistent with a scenario in which 
1) the per-locus fraction of inaccurate lES excisions occasion- 
ally involves a number of somatic DNA copies that is suffi- 
ciently large to affect individual fitness and 2) transcripts 
that incorporate lESs are efficiently degraded. Despite this in- 
dication of efficient counter-selection, our results suggest that 
the incorporation of lESs into coding sequences occasionally 
takes place. These infrequent events, which could have sub- 
stantial evolutionary implications, are examined below. 

Most Putative (Imperfectly Excised) lESs in One P. aurelia 
Species Are Absent from the Macronuclear Genome of 
the Remaining Species 

The availability of the macronuclear genome sequences of 
three closely related species of ciliates in tandem with a rele- 
vant collection of lESs provides an excellent opportunity for 
investigating the contribution of lESs (i.e., germline-specific 
sequences) to somatic genome content. Specifically, as a 
result of mutations in excision signals, some lESs could 
become integral parts of the somatic DNA of an individual 
and, when nonlethal and heritable, the lES-containing alleles 
could spread and fix in the species. Under this scenario, the 
alignment of homologous somatic DNA sequences of closely 
related species would display insertion-deletion polymor- 
phisms associated with lES retention. 

To gain insights into this hypothetical scenario, we asked if 
imperfectly excised lESs in one P. aurelia species are absent 
from macronuclei of the other P. aurelia species. This analysis 
showed that the majority of incompletely excised lESs in a 
given P. aurelia species are indeed absent from the putative 
orthologous (or least divergent homologous) loci of the other 
two species (fig. 4; supplementary table S5, Supplementary 
Material online). lESs can be absent from the somatic genome 
assemblies for two reasons: they are excised from the corre- 
sponding germline DNA, or they are absent from the germline 
DNA. In the former case, one should be able to detect the 
signature of this excision, the 5'-TA-3' dinucleotide, known to 
remain in the somatic DNA after lES removal. Indeed, in 97- 
100% of the cases, a 5'-TA-3' is conserved at the expected 
position in between-species comparisons. Although it is diffi- 
cult to draw firm quantitative conclusions from these obser- 
vations, these results are consistent with most lES excision 
events being conserved across the three P. aurelia species. 

Most cryptic lESs, on the contrary, are detected in the mac- 
ronuclear DNA of the other two species (fig. 4; supplementary 
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Fig. 3. — Position of lESs in coding sequences. Correlation analyses between number of exon-mapping lESs (partitioned according to size and presence of 
PTCs) and lES distance from the gene start in Paramecium tetraurelia. Distance values are allocated in 10 bins and are equal to the ratio between the IBS 
distance from the gene start and the total gene length. 



table S5, Supplementary Material online). This observation is 
consistent with the majority of cryptic lESs being the product 
of erroneous excision, presumably resulting from a partial re- 
semblance of nearby sequences to canonical recognition or 
excision signals (fig. 2; supplementary fig. SI 6 and table S2B, 
Supplementary Material online). 

lES Excision Variability between P. aurelia Species 

We also detected some cases of variability in lES excision 
among the three P. aurelia species (fig. 4; supplementary 
table S5, Supplementary Material online). For example, 
when the 875 P. tetraurelia incompletely excised lESs were 
used (with or without additional flanking regions) to query 
the genome of the two remaining species, 14 and 4 of 
these lESs were found to match significantly (E value = 10"^) 
and uniquely with the P. biaurelia and P. sexaurelia assembled 
macronuclear genomes, respectively. This suggests that a frac- 
tion of the lESs detected in a given P. aurelia species may no 



longer be (or may have never been) excised from the macro- 
nuclear genome of other species of P. aurelia. 

Cases of lES presence-absence polymorphisms involve 
cryptic lESs as well. For example, when 40-bp regions imme- 
diately flanking the 1470 P. tetraurelia cryptic lESs were 
merged and used to query the other species' genome se- 
quences, 10 and 9 have hits in the P. biaurelia and the 
P. sexaurelia genomes, respectively. This latter observation is 
intriguing, suggesting that regions rarely excised from the 
macronuclear DNA of one P. aurelia species can be faithfully 
excised from homologous loci in closely related species. 

Could the observed lES excision polymorphisms reflect mis- 
assembly errors due to micronuclear contamination? In prin- 
ciple, if a collection of reads mapping onto a genomic region 
consists of both gapped and un-gapped sequences (e.g., mac- 
ronuclear and micronuclear sequences, respectively), errone- 
ous information could be incorporated during the assembly 
process, and, as a result, a comparison between two or more 
genomes would exhibit artificial insertion/deletion polymor- 
phisms. These sequence misincorporation events would be 
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Fig. 4. — Conservation and variability of lES excision between Paramecium aurelia species. The BLAST program is employed to determine whether 
imperfectly excised lESs and cryptic lESs detected in one P. aurelia species (query species) are present or absent from the macronuclear genome assembly of 
the remaining species (database species). We screen the genome assembly of the database species using: 1) merged lES-flanking regions (40 nt from each 
end); 2) lES-flanking regions (40 nt from each end) plus the intervening lES; and 3) lES sequences only. The graph represents the cumulative number of hits in 
the genome assembly of the database species. 



expected to be nnore common for genomic regions that are 
supported by only a few reads. Also, the lil<elihood of DNA 
region misincorporation would be expected to be the highest 
when the ratio between gapped and ungapped reads 
approaches 50%. To reduce this type of inaccuracy, we fil- 
tered our data set for genomic regions covered by at least 
eight reads and whose current configuration is supported by 
a fraction of reads higher than 70%. In addition, we only 
included loci that can be reliably aligned between all three 
P. aurelia species. Under these arbitrary thresholds, we 
retrieved nine lES-associated regions that are differentially ex- 
cised between P. aurelia species (supplementary table S6, 
Supplementary Material online). These regions contain two 
imperfectly excised lESs, and seven cryptic lESs, amounting 
to 0.12% of all the imperfectly excised lESs and 0.20% of 
all the cryptic lESs detected in our study. 

Finally, we asked how many of the approximately 45,000 
germline-specific sequences in P. tetraurelia are differentially 
excised between the surveyed species. We found that at least 
71 and 15 of these lESs are found in the macronuclear 
genome assembly of P. biaurelia and P. sexaurelia, respectively 



(supplementary table S7, Supplementary Material online). 
Remarkably, approximately 50% of these differentially excised 
lESs reside within coding sequences (supplementary table S8, 
Supplementary Material online). It is worth noting that as the 
orthology relationships between the examined species are cur- 
rently under investigation, the (BLAST-mediated) identification 
of differential lES excision events in our study relies on the 
levels of between-species sequence conservation of lES-flank- 
ing regions. 

Preliminary Characterization of lESs That Are Differentially 
Excised between P. aurelia Species 

The nine lES loci that appear as differentially excised between 
P. aureiia species correspond to regions that in P. tetraurelia 
are annotated as exonic (n = 5), intronic (n = 1), and inter- 
genic (n = 3) (supplementary table S6, Supplementary 
Material online). Although the exact structure of the genes 
in P. biaurelia and P. sexaurelia is currently unknown, the 
location of variably excised lESs hints at the occurrence of 
between-species changes in protein primary structure and. 
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Fjg. 5. — Putative example of a mutational event triggering erroneous excision of a macronuclear DNA region. A T A mutational change (indicated by 
the arrow) presumably triggers the excision of a macronuclear-destined sequence (part of the gene model GSPATG00029756001 ) in Paramecium tetraurelia. 
The alignment includes all detectable homologous sequences in the surveyed Paramecium species (BLAST E value = 1 0^^). The unrooted NJ-tree is based on 
regions that immediately flank the lES under examination (1 14 sites). Bootstrap values lower than 75% are not shown. Estimates of sequencing coverage are 
provided for each of the scaffold regions that are shown in the alignment: 13x (11 ungapped and 2 gapped reads) for P. fefraure/Za scaffold 112; lOx and 
19x (all ungapped reads) for P. tetraurelia scaffold 19 and scaffold 75, respectively; 6x and 12x (all ungapped reads) for P. biaurelia scaffold00658 and 
scaffold00308, respectively; lOx and 17x (all ungapped reads) for P. sexaurelia scaffold2033 and scaffold00049, respectively. 



perhaps, in levels of gene expression. With respect to possible 
changes in gene primary structure, it is worth noting that only 
one of the lESs mapping to coding exons has a size that is a 
multiple of 3. This implies that the retention/excision of the 
remaining non-3n lESs would lead to a shift in reading frame 
(as defined in P. tetraurelia), unless gene structure differs be- 
tween species and/or additional indel mutations maintain the 
reading frame. 

When compared with a family of homologous sequences 
within and across species (i.e., paralogous and orthologous 
loci), the insertion-deletion lES polymorphisms are typically 
restricted to only one member of the family, which suggests 
that these events are recent. In one set of sequence align- 
ments a differential IBS retention/excision event between para- 
logous regions was jointly detected for two species 
(supplementary fig. S2, Supplementary Material online). It is 
worth noting that absence of a putative lES from a genomic 
locus may indicate 1) that the lES is excised or 2) that the 
missing sequence is entirely absent from the micronuclear 
DNA. With our current data, we cannot discriminate between 
these two possibilities. 

Because of the central role played by the flanking TA dinu- 
cleotides in IBS excision, the mutational loss of TA dinucleo- 
tides around an lES is expected to prevent excision. 
Alternatively, the creation of sufficiently distant TA dinucleo- 
tides in an appropriate sequence context may lead to the 
erroneous recognition of the intervening sequence as an lES. 
As a putative example of this, we detected a TA-flanked IBS 
that aligns with TT-flanked macronuclear DNA sequences, 
within and across P. aurelia species (fig. 5). These TT-flanked 
macronuclear DNA sequences align with ungapped reads 



only. A parsimonious interpretation for this finding is that a 
macronuclear DNA region of P. tetraurelia has undergone ex- 
cision as a result of a mutational change (TjT ^ T| A) not found 
in other paralogs and orthologs. This macronuclear sequence 
has a size that is multiple of 3, and is part of an expressed 
P tetraurelia gene (GSPATG00029756001) coding for a pro- 
tein with unknown function. 

Discussion 

DNA splicing is one of the most intensively studied DRGRs, and 
particular attention has been devoted to the mechanisms in 
ciliates that effect the excision of IBSs (Mochizuki and 
Gorovsky 2005; Kowalczyk et al. 2006; Gratias et al. 2008; 
Lepere et al. 2008; Kapusta et al. 201 1). Bar less consideration, 
however, has been given to the evolutionary significance of 
this biological process. 

The availability of macronuclear genome sequences of clo- 
sely related species of the ciliate Paraivecium (Aury et al. 2006) 
(McGrath CL, Doak TG, Lynch M, unpublished data) provides 
an excellent opportunity for a comparative study, both to 
identify regions potentially involved in the regulation of DNA 
splicing and to explore the contribution of this process to the 
reshaping of a species' genome content. 

In this study, we identify a total of 5,286 putative IBSs 
across three moderately divergent species of the P. aurelia 
species complex, P. tetraurelia, P. biaurelia, and P. sexaurelia 
(Sonneborn 1975; Catania et al. 2009). The thorough macro- 
nuclear DNA puhfication procedure applied here, as in Aury 
et al. (2006), suggests that the vast majority of the IBSs iso- 
lated from the surveyed macronuclear DNA sequences result 
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from less-than-perfect lES excision, which takes place during 
the regeneration of a new macronuclear genonne at each 
sexual generation (Duret et al. 2008). 

We find that the imperfectly excised sequences share sev- 
eral similar features: 1) they are typically short (<100bp) 
(fig. 1 ) and AT-rich (>76%) (table 1 ); 2) they include imperfect 
inverted terminal repeats whose nucleotide frequencies and 
compositions are mostly conserved across species (fig. 2); 

3) they are immediately flanked by an excess of Gs and Cs 
(supplementary fig. SI, Supplementary Material online); and 

4) they are located in intergenic, intronic, and coding regions 
(table 1). 

The effect of natural selection on the evolution of these 
sequences is apparent for lESs that reside in genes. In partic- 
ular, fewer than expected 3n lESs are located within coding 
regions in P. tetraurelia, the only species whose genome is 
currently annotated. This deficit extends to the subset of 
lESs that are 3n in size and do not contain PTCs. A hypothesis 
for the latter observation maintains that PTC -free 3n lESs in- 
corporated into mature mRNAs would unsafely elude cellular 
surveillance systems. Thus, selection appears to operate on 
lESs that interrupt coding regions, eliminating those lESs that 
do not provide premature translation termination in the case 
of inaccurate excision (Duret et al. 2008). This scenario is both 
plausible and consistent with an additional observation of our 
study: lESs that introduce a PTC as a result of imperfect exci- 
sion accumulate at the 5'-end of gene transcripts (fig. 3). This 
preferential location of lESs corresponds to the region of the 
transcript where in other eukaryotes NMD most efficiently 
identifies aberrant mRNA transcripts (van Hoof and Green 
1 996; Silva et al. 2006; Longman et al. 2007) and more rapidly 
promotes termination of protein synthesis. Thus, our in silico 
observations suggest that the way the NMD pathway operates 
in Paramecium and in other multicellular eukaryotes is similar. 

The finding that the majority of lESs detected as imperfectly 
excised in one species are absent from the macronuclear 
genome assembly of other species is compatible with the 
hypothesis that the lES excision profile is largely conserved 
between closely related P. aurelia species. A nonmutually ex- 
clusive explanation for our observations, however, is that an 
lES detected in one species is not observed in other species 
simply because it is absent from their micronuclear genome. In 
other words, it is possible that many lESs are species specific. A 
formal test for this hypothesis requires a better understanding 
of the lES loss/gain dynamics within species and the nontrivial 
inference of orthologous genomic regions between the sur- 
veyed Paramecium species. 

We detected lES excision variability between and within 
P. aurelia species (fig. 4, supplementary table S5, Supple- 
mentary Material online). Our results indicate that a subset 
(0.20%) of the sequences that are incorrectly excised from 
only a few copies of the polyploid macronuclear genome of 
one particular P. aurelia species (i.e., cryptic lESs) may be faith- 
fully excised — or lost — from the macronuclear genome of 



other P. aurelia species. In the same way, lESs that are incor- 
rectly retained in only a few copies of the macronuclear 
genome of a P. aurelia species (i.e., imperfectly excised lESs) 
appear to be an integral part of the assembled macronuclear 
genome of other P. aurelia species (0.12%). The existence of 
differential events of lES excision/retention between P. aurelia 
species is consistent with the results of an additional analysis: 
when we query the entire set of approximately 45,000 lESs in 
the P. tetraurelia genome, we find that at least 86 lESs are 
differentially excised between species. It is worth noting that a 
more accurate quantification of differential lES excision events 
between species requires the availability of higher sequencing 
coverages (and thus a larger number of lESs) and completion 
of the P. biaurelia and P. sexaurelia assemblies. 

Manually curated analyses of a limited number of cases of 
putative differential lES excision/retention between the three 
P. aurelia species reveal that these events are 1) mostly unique 
(i.e., involve only one species and no other paralogs in that 
species); 2) flanked by a 5'-TA-3' dinucleotide; and 3) often 
located within regions annotated as coding exons in the 
P. tetraurelia genome. These observations suggest that the 
majority of alternative DNA rearrangements have arisen re- 
cently, at most after the intermediary whole genome duplica- 
tion, before the formation of the P. aurelia species complex 
(Aury et al. 2006). A compatible explanation is that genes with 
alternative rearrangements are frequently lost. Additionally, 
the location of these events implies that variable lES excision 
between species may, theoretically, alter both protein 
sequence and gene expression. 

Our study suggests that interconversion between macro- 
nuclear-destined DNA sequences and lESs takes place in 
Paramecium. This two-way process is reminiscent of the inter- 
conversion hypothesized for spliceosomal introns and lESs in a 
study on spirotrichous ciliates (Chang et al. 2007) and of a 
model advanced for the evolutionary origin of spliceosomal 
introns in eukaryotes (Catania and Lynch 2008). In the intro- 
nization model, exonic and intronic sequences undergo a pro- 
cess of mutual conversion that includes a transient phase 
where sequences are neither fully exonic nor fully intronic, 
an intrinsic property of sequences undergoing alternative splic- 
ing (Catania and Lynch 2008). The interesting similarity be- 
tween lESs and spliceosomal introns goes further, including a 
positional bias along genes (Catania and Lynch 2008; this 
study), the mechanism of selective cost (Lynch 2002; this 
study), and the potential role of these sequences in the gen- 
eration of new genomic information (Gao and Lynch 2009; 
this study), which could impinge on phenotype. 

All ciliates examined have lESs. Although it is not known if 
these lESs developed in the ancestral ciliate or evolved inde- 
pendently in different ciliates, it is reasonable that some of 
these sequences have had enough time to be co-opted for 
various other functions and to have developed specific selec- 
tive roles in some genes in some species. For example, imper- 
fect excision of lESs from coding sequence, when not harmful. 
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may alter the primary structure of a gene and perhaps the 
function of the coded protein. Also, the retention of lESs lo- 
cated in noncoding regions could alter (or create de novo) 
regulatory elements. 

Our study represents a first attempt to explore the existence 
of variability in the pattern of lES excision among closely re- 
lated species of ciliated protozoa. Despite the limitations of 
our approach in inferring the presence or absence of lESs in a 
genome, our findings are encouraging, providing clear imper- 
atives for future investigations, and compatible with a scenario 
in which DNA splicing contributes to the creation of poten- 
tially functional genetic innovation. 

Supplementary Material 

Supplementary figures SI and S2 and tables S1-S8 are avail- 
able at Genome Biology and Evolution online (http:/AAAAAA/. 
gbe.oxfordjournals.org/). 
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